Red Wine Quality by Pablo Rivero

INTRODUCTION

Here I explore and analyze a data set that contains contains 1,599 red wines with 11 variables on the chemical properties of the wine: “fixed.acidity”, “volatile.acidity”, “citric.acid”,“residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, and “quality”. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The goal is to extract some conclusions about the relation between chemical properties and the quality of the wines.

I load some libraries

I load the DataFile

Univariate Plots Section

The next is going to be a preliminary exploration of the dataset. I have to understand the structure of the individual variables in the dataset. I first check the dimension, what variables I have, Min, Max, mean,medians in order to have an initial sense

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Initial Observations

1- PH is between 2.74 and 4.01. Median is 3.31

2- Max quality value is 8. Min is 3. Median is 6

3- Alcohol is between 8.4% and 14.9%.

4- Residual sugar range is large, between 0.9 and 15.5. But median is 2.2g/dm^3

Let’s make some plots:

One of the main interests in this study is to understand the chemical properties that produce a great quality wine.

I decided to break quality in 3 ranges and consider 7 and 8 high quality (since only 18 wines have quality) (0-5], (5-6], and (7-10] and check some chemical properties like pH and alcohol.

##  (0,5]  (5,6] (6,10] 
##    744    638    217

Only 13.6% of wines have 6 or more quality value All quality wines have similar amount of pH. Not important factor. Same for sulphates and density. However, No good wines (6,10 range) have less than 10% of alcohol.

In thee following plots I want to understand the desviation of the data for some variables using the boxplot function

There are some points that could be considered outliers at >=14% alcohol or below 9% of alcohol. We should discuss if consider these datapoints when doing stadistics. Most of the wines are between 9 and 11

Most of the wines have between ~1.7 and 3.4 of residual sugar. In this case it looks like there are a huge amount of outliers. Probablyy we should not consider all wines above 8. Large deviations

Most of the wines have between 3 and 3.6 of pH. In this case the deviation is not as high as the residual sugar. Some outliers are below 2 or above 4.

What is the structure of your dataset?

Dataset is made by 1599 observables and 13 variables

What is/are the main feature(s) of interest in your dataset?

The quality of the wine

What other features in the dataset do you think will help support your

The relation between the properties of the wine with the quality value.

Did you create any new variables from existing variables in the dataset?

Yes: quality.bucket in order to group better the quality of the wines.

Of the features you investigated, were there any unusual distributions?

Unusual distributions were: citric.acid and free.SO2 Normal distributions: density, pH, and vlatile.acidity Negative: sugar, total SO2, alcohol and sulphates

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

Yes. I created reds_short subset data to explore different bivariate plots and correlations with less amount of variables (next section)

Bivariate Plots Section

Let’s check with ggpairs different Bivariate plots to have an initial sense of the highest correlations between variables in this DataSet

Notes about correlations: 1: There is a good correlation (0.668) between fixed acidity and density.

2: There is a good negative correlation (-0.683) between fixed acidity and pH.

3: There is a good correlation (0.668) between Total SO2 and free SO2.

4: There is a good correlation (0.672) between fixed acidity and citric.acid.

Let’s see some of this relations with their regresion lines

## 
##  Pearson's product-moment correlation
## 
## data:  reds$fixed.acidity and reds$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

Density vs fixed acidity shows a R=0.668. There is still some big dispersion with datapoints that should be removed (density=0.99,acidity=8). Bur due to the large amount of data those point would not change to much the value of R.

## 
##  Pearson's product-moment correlation
## 
## data:  reds$fixed.acidity and reds$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

There is a good negative correlation (-0.683) between fixed acidity and pH.

## 
##  Pearson's product-moment correlation
## 
## data:  reds$free.sulfur.dioxide and reds$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

There is a good correlation (0.668) between Total SO2 and free SO2.

Let’s focus now in the chemical properties of a good wine in comparison with other quality wine.

Alcohol increases with the quality of the wine

pH remains between 3.3 and 3.8 for all quality wines.

Less dispersion can be seen on chlorides as we go for high quality wines. Keeping them close to 0.1

It was interesting that just a few wines had below 10% of alcohol so I wanted to know more:

## reds$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## reds$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## reds$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## reds$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## reds$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## reds$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

While there is not too much wines with more than 12% of alcohol in the low range of quality, there are not below 10% in the high quality level.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I used a multiscatter plot to look for relationships between all chemicals. I also looked individually how each chemical is related with the quality of the wine. I found that we can only get some hints about how these chemicals can produce a good quality wine and it is more the sum of these chemicals. However, I found that the degree of alcohol increases with the quality of wine.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

1: There is a good correlation (0.668) between fixed acidity and density.

2: There is a good negative correlation (-0.683) between fixed acidity and pH.

3: There is a good correlation (0.668) between Total SO2 and free SO2.

4: There is a good correlation (0.672) between fixed acidity and citric.acid.

5: Quality and level of alcohol

What was the strongest relationship you found?

The fixed acidity with the pH

Multivariate Analysis

Good correlation between pH and fixed acidity. Where low pH correspond with low fixed acidity, and when density increases, fixed acidity too, but not pH.

Good correlation between alcohol and density. This correlation looks better for more quality wines although still a lot of dispersion. High alcoholic wines usually have more pH.

By considering a wine with the following qualities: alcohol>10% sulphates>0.70 pH<3.3 chlorides<0.075 volatile.acidity<0.4 total.sulfur.dioxide<40

##  (0,5]  (5,6] (6,10] 
##    744    638    217
##  (0,5]  (5,6] (6,10] 
##      2     11     27

We go from 13.6% of 7 or 8 quality wines to 67.5%

Based on this I create a model to predict the quality of a wine based on the amount of the different chemicals it has.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = reds)
## m2: lm(formula = quality ~ alcohol + pH, data = reds)
## m3: lm(formula = quality ~ alcohol + pH + sulphates, data = reds)
## m4: lm(formula = quality ~ alcohol + pH + sulphates + volatile.acidity, 
##     data = reds)
## m5: lm(formula = quality ~ alcohol + pH + sulphates + volatile.acidity + 
##     total.sulfur.dioxide, data = reds)
## 
## ==============================================================================================
##                              m1            m2            m3            m4            m5       
## ----------------------------------------------------------------------------------------------
##   (Intercept)               1.875***      4.426***      3.345***      3.493***      3.749***  
##                            (0.175)       (0.387)       (0.401)       (0.385)       (0.387)    
##   alcohol                   0.361***      0.386***      0.367***      0.321***      0.308***  
##                            (0.017)       (0.017)       (0.017)       (0.016)       (0.017)    
##   pH                                     -0.850***     -0.635***     -0.306**      -0.319**   
##                                          (0.116)       (0.116)       (0.115)       (0.115)    
##   sulphates                                             0.868***      0.635***      0.667***  
##                                                        (0.104)       (0.102)       (0.102)    
##   volatile.acidity                                                   -1.156***     -1.130***  
##                                                                      (0.100)       (0.100)    
##   total.sulfur.dioxide                                                             -0.002***  
##                                                                                    (0.001)    
## ----------------------------------------------------------------------------------------------
##   R-squared                 0.227         0.252         0.283         0.339         0.347     
##   adj. R-squared            0.226         0.251         0.282         0.337         0.345     
##   sigma                     0.710         0.699         0.684         0.657         0.654     
##   F                       468.267       268.888       210.183       204.210       169.271     
##   p                         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1721.057     -1694.466     -1660.297     -1595.858     -1585.956     
##   Deviance                805.870       779.508       746.896       689.059       680.578     
##   AIC                    3448.114      3396.931      3330.594      3203.717      3185.913     
##   BIC                    3464.245      3418.440      3357.480      3235.980      3223.553     
##   N                      1599          1599          1599          1599          1599         
## ==============================================================================================

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

-Low fixed acidity have lower pH as we supposed

-A high quality wine should have: low pH, low amount of chlorides, volatile.acidity, density and high amount of citric acidity, alcohol degree, and sulphates.

Were there any interesting or surprising interactions between features?

Alcohol is directly related with quality of wine. Fixed acidity is related with pH

OPTIONAL: Did you create any models with your dataset? Discuss the

Create a model to predict the quality of a wine based on the amount of the chemical variables but R^2=0.347 (too low). There is too much dispersion to arrive to an a conclusion like this based on the chemicals and the different taste of 3 persons. ——

Final Plots and Summary

Plot One

## reds$quality.bucket: (0,5]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.400   9.700   9.926  10.300  14.900 
## -------------------------------------------------------- 
## reds$quality.bucket: (5,6]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## reds$quality.bucket: (6,10]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00
## reds$quality.bucket: (0,5]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.200   3.310   3.312   3.400   3.900 
## -------------------------------------------------------- 
## reds$quality.bucket: (5,6]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## reds$quality.bucket: (6,10]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.270   3.289   3.380   3.780

Description One

The amount of alcohol degree is important when considering a high quality wine. As we can see in the previous figure. Practicaly no wines with more than 12% of alcohol have qualities over 5. Being here, the wines with less than 10% the are the most. On the other side, wines with more than 12% of alcohol are abundant on quality wines of 7 or 8 values while there are almost no wines with lss than 10%. pH=3.3 (black solid line) is kept more or less in the middle of all distributions. We can see some displacement towards a loer pH when we go to high quality wines.

Plot Two

Description Two

In this plot we can see how evolves the median for alcohol as a function of quality. A high quality wine should have high level of alcohol in general. However, dispersion is big.

We go from 13.6% of 7 or 8 quality wines to 67.5%

Plot Three

Description Three

Low fixed acidity have high pH (blue dots) as we could have thought. Besides, from the previous figure we can also see that density increases with the fixed acidity increment. The correlation between these two variables is the highest I found in the DataSet.

Reflection

I found several correlations between different chemicals on wine. I could understand, in general, what of these chemicals can have direct relationships with the quality of the wine. My struggles here were that, since I do not know too much of these chemicals (more literature research could have worked) I started the analysis quite blind. Only alcohol, pH and quality had sense for me Also, I wanted to find a model to predict the quality of the wine based on the amount of chemicals but it turns out that the correlation of that function was quite bad (R~0.6) -I think more persons to qualify the wine would have given better results. -Other variables like age of the wine could also be important.